Improving the Performance of Text Categorization using N-gram Kernels

نویسندگان

  • Santhosh Kumar
  • Reghu Raj
چکیده

Kernel Methods are known for their robustness in handling large feature space and are widely used as an alternative to external feature extraction based methods in tasks such as classification and regression. This work follows the approach of using different string kernels such as n-gram kernels and gappy-n-gram kernels on text classification. It studies how kernel concatenation and feature combination affects the classification accuracy of the system. It also explores how the kernel combination algorithms work on the system. The kernels are implemented as rational kernels, which satisfies the Mercer’s Theorem ensuring the kernel matrices to be positive definite symmetric. The rational kernels are computed with a general algorithm of composition of weighted transducers which help in dealing with variable length sequences. These kernels are then used with SVM formulating efficient classifier for text categorization. Both one-stage and two stage algorithms are applied for kernel combination which were successful in achieving better system performance compared to that given by individual kernels.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Can characters reveal your native language? A language-independent approach to native language identification

A common approach in text mining tasks such as text categorization, authorship identification or plagiarism detection is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. In this work, an approach that uses character n-grams as features is proposed for the task of native language identification. Instead of doing standard feature selection,...

متن کامل

A Study Using n-gram Features for Text Categorization

In this paper, we study the effect of using n-grams (sequences of words of length n) for text categorization. We use an efficient algorithm for generating such n-gram features in two benchmark domains, the 20 newsgroups data set and 21,578 REUTERS newswire articles. Our results with the rule learning algorithm R IPPER indicate that, after the removal of stop words, word sequences of length 2 or...

متن کامل

Text Categorization Techniques for Intrusion Detection -- A N-Gram-Based Method

Text categorization techniques have been used in anomaly intrusion detection by Liao and Vermuri in USENIX 02 paper. [1] Another n-gram-based text categorization method proposed in this report is expected to improve the performance of intrusion detection system that implements Liao’s method.

متن کامل

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

A Comparison of Text-Categorization Methods Applied to N-Gram Frequency Statistics

This paper gives an analysis of multi-class e-mail categorization performance, comparing a character n-gram document representation against a word-frequency based representation. Furthermore the impact of using available e-mail specific meta-information on classification performance is explored and the findings are presented.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015